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ABSTRACT 

Recently  published  studies  have  shown  that  partitional  clustering 
algorithms  that  optimize  certain  criterion  functions,  which  measure 
key  aspects  of  inter-  and  intra-cluster  similarity,  are  very  effective 
in  producing  hard  clustering  solutions  for  document  datasets  and 
outperform  traditional  partitional  and  agglomerative  algorithms.  In 
this  paper  we  study  the  extent  to  which  these  criterion  functions 
can  be  modified  to  include  soft  membership  functions  and  whether 
or  not  the  resulting  soft  clustering  algorithms  can  further  improve 
the  clustering  solutions.  Specifically,  we  focus  on  four  of  these  hard 
criterion  functions,  derive  their  soft-clustering  extensions,  present  a 
comprehensive  experimental  evaluation  involving  twelve  different 
datasets,  and  analyze  their  overall  characteristics.  Our  results  show 
that  introducing  softness  into  the  criterion  functions  tends  to  lead  to 
better  clustering  results  for  most  datasets  and  consistently  improve 
the  separation  between  the  clusters. 

Keywords 
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1.  INTRODUCTION 

Fast  and  high-quality  document  clustering  algorithms  play  an  im¬ 
portant  role  in  helping  users  to  effectively  navigate,  summarize, 
and  organize  an  enormous  amount  of  text  documents  available  on 
the  Internet,  digital  libraries,  news  sources,  and  company-wide  in¬ 
tranets.  Over  the  years  a  variety  of  different  algorithms  have  been 

*This  work  was  supported  in  part  by  NSF  CCR-9972519,  EIA- 
9986042,  ACI-9982274,  ACI-0133464,  and  ACI-0312828;  the 
Digital  Technology  Center  at  the  University  of  Minnesota;  and 
by  the  Army  High  Performance  Computing  Research  Center  (AH- 
PCRC)  under  the  auspices  of  the  Department  of  the  Army,  Army 
Research  Laboratory  (ARL)  under  Cooperative  Agreement  number 
DAAD19-01-2-0014.  The  content  of  which  does  not  necessarily 
reflect  the  position  or  the  policy  of  the  government,  and  no  official 
endorsement  should  be  inferred.  Access  to  research  and  computing 
facilities  was  provided  by  the  Digital  Technology  Center  and  the 
Minnesota  Supercomputing  Institute. 


George  Karypis 

University  of  Minnesota,  Department  of 
Computer  Science  and  Engineering, 
Digital  Technology  Center,  and  Army  HPC 
Research  Center,  Minneapolis,  MN  55455 

karypis@cs.umn.edu 


developed.  These  algorithms  can  be  categorized  along  different  di¬ 
mensions  based  either  on  the  underlying  methodology  of  the  algo¬ 
rithm,  leading  to  agglomerative  [36,  24,  15,  16,  22]  or  partitional 
approaches  [28,  20,  32,  8,  41,  19,  38,  7,  13],  or  on  the  nature  of  the 
membership  function,  leading  to  hard  (crisp)  or  soft  (fuzzy)  solu¬ 
tions. 

In  recent  years,  soft  clustering  algorithms  have  been  studied  in  doc¬ 
ument  clustering  and  shown  to  be  effective  [29,  25,  30]  in  find¬ 
ing  both  overlapping  and  non-overlappying  clusters.  Studies  have 
shown  that  “hardening”  the  results  obtained  by  fuzzy  C-means  pro¬ 
duces  better  hard  clustering  solutions  than  direct  A'-means  [17], 
which  suggests  that  including  soft  membership  functions  into  other 
criterion  functions  may  lead  to  better  hard  clustering  solutions  as 
well. 

Recently,  we  studied  seven  different  hard  partitional  clustering  cri¬ 
terion  functions  in  the  context  of  document  clustering,  which  op¬ 
timize  various  aspects  of  intra-cluster  similarity,  inter-cluster  dis¬ 
similarity,  and  their  combinations  [45,  44,  46],  Our  experiments 
showed  that  different  criterion  functions  lead  to  substantially  dif¬ 
ferent  results,  whereas  our  analysis  showed  that  their  performance 
depends  on  the  degree  to  which  they  can  correctly  operate  when 
the  dataset  contains  clusters  of  different  densities  (i.e.,  they  contain 
documents  whose  pairwise  similarities  are  different)  and  the  degree 
to  which  they  can  produce  balanced  clusters.  We  also  showed  that 
among  these  seven  criterion  functions,  there  are  a  set  of  criterion 
functions  that  consistently  outperform  the  rest. 

The  focus  of  this  paper  is  to  extend  four  of  these  hard  criterion  func¬ 
tions  (2i,  X2,  Si,  Q 1  [45])  to  allow  soft  membership  functions,  and 
to  see  whether  or  not  introducing  softness  into  these  criterion  func¬ 
tions  leads  to  better  clustering  solutions.  These  criterion  functions 
were  selected  because  they  include  some  of  the  best-  and  worst- 
performing  schemes,  and  represent  some  of  the  most  widely-used 
criterion  functions  for  document  clustering.  We  developed  a  hard- 
clustering  based  optimization  algorithm  that  optimizes  the  various 
soft  criterion  functions.  Since  this  optimization  algorithm  simul¬ 
taneously  produces  both  a  hard  and  a  soft  clustering  solution,  we 
focused  on  evaluating  the  hard  clustering  solution  and  compared  it 
with  the  one  obtained  by  the  hard  criterion  functions.  We  present 
a  comprehensive  experimental  evaluation  involving  twelve  differ¬ 
ent  datasets.  Our  experimental  results  show  that  introducing  soft¬ 
ness  into  the  criterion  functions  tends  to  consistently  improve  the 
separation  between  the  clusters.  Although  the  experimental  results 
show  some  dataset  dependency,  for  most  datasets  the  soft  criterion 
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functions  tend  to  lead  to  better  clustering  results.  Moreover,  our 
experimental  results  show  that  the  soft  clustering  extension  of  the 
worst-performing  hard  criterion  function  (Xi)  achieves  the  best  rel¬ 
ative  improvement. 


used  soft  clustering  algorithms.  It  is  a  soft  version  of  the  A'-means 
algorithm  that  uses  a  soft  membership  function.  Given  a  set  of  ob¬ 
jects  di,d2, ...,  dm i,  the  fuzzy  C-means  algorithm  tries  to  optimize 
a  least-squared  error  criterion: 


The  rest  of  this  paper  is  organized  as  follows.  Section  2  provides 
some  information  on  how  documents  are  represented  and  how  the 
similarity  or  distance  between  documents  is  computed.  Section  3 
discusses  some  existing  soft  clustering  algorithms  related  to  our 
work.  Sections  4  and  5  describe  the  four  hard  criterion  functions 
that  are  the  focus  of  this  paper  and  presents  their  soft  clustering 
extensions,  respectively.  Section  6  describes  the  algorithm  that  op¬ 
timizes  the  various  soft  criterion  functions  and  the  clustering  algo¬ 
rithm  itself.  Section  7  provides  the  detailed  experimental  evalu¬ 
ation  of  the  various  soft  criterion  functions.  Section  8  discusses 
some  important  observations  front  the  experimental  results.  Fi¬ 
nally,  Section  9  provides  some  concluding  remarks  and  future  re¬ 
search  directions. 


2.  PRELIMINARIES 

Through-out  this  paper  we  will  use  the  symbols  n,  m,  and  k  to  de¬ 
note  the  number  of  documents,  the  number  of  terms,  and  the  num¬ 
ber  of  clusters,  respectively.  We  will  use  the  symbol  S  to  denote 
the  set  of  n  documents  that  we  want  to  cluster,  SI,  £2,  •  ■  ■ ,  Sk  to 
denote  each  one  of  the  k  clusters,  and  m,  n,2,  ■  ■  ■ ,  nk  to  denote  the 
sizes  of  the  corresponding  clusters. 

We  represent  the  documents  using  the  vector-space  model  [35].  In 
this  model,  each  document  d  is  considered  to  be  a  vector  in  the 
space  of  the  distinct  terms  present  in  the  collection.  We  employ 
the  tf-idf  term- weighting  scheme  that  represents  each  document  d 
as  the  vector 

dtfidf  =  (tfi  x  fflvfa  x  idfn  ■  ■  ■  ’  tfm  x  idfm )• 

In  this  scheme,  tfi  corresponds  to  the  frequency  of  the  zth  term  in 
the  document  and  idf  =  log (n/dff)  corresponds  to  its  inverse  doc¬ 
ument  frequency  in  the  collection  (df  is  the  number  of  documents 
that  contain  the  zth  term).  To  account  for  documents  of  different 
lengths,  we  scale  the  length  of  each  document  vector  so  that  it  is  of 
unit  length. 


We  measure  the  similarity  between  a  pair  of  documents  ck  and 
dj  by  taking  the  cosine  of  the  angle  formed  between  the  tf-idf 
representation  of  their  vectors.  Specifically,  this  is  defined  as 


cos  (di,dj) 


dfdj 

II*  UK- II’ 


which  can  be  simplified  to  cos (di,dj)  =  dfdj,  since  the  docu¬ 
ment  vectors  are  of  unit  length.  This  similarity  measure  becomes 
one  if  the  document  vectors  point  to  the  same  direction  (i.e.,  they 
contain  identical  set  of  terms  in  the  same  relative  proportion),  and 
zero  if  there  is  nothing  in  common  between  them  (i.e.,  the  vectors 
are  orthogonal  to  each  other). 


Finally,  given  a  set  A  of  documents  and  their  corresponding  vector 
representations,  we  define  the  composite  vector  Da  to  be  Da  = 
5 ZdeAd >  and  the  centroid  vector  Ca  to  be  Ca  =  Da/\A\. 

3.  RELATED  RESEARCH 

Soft  clustering  that  allows  an  object  to  appear  in  multiple  clusters 
has  been  studied  extensively  and  still  remains  of  great  interest.  As 
many  datasets  and  application  domains  require  soft  clustering  solu¬ 
tions.  The  fuzzy  C-means  algorithm  [3]  is  one  of  the  most  widely 


k  /  N 

=  \\di-Cr 

r= 1  \  7=1 

where  /iqr  is  the  df  s  membership  in  the  rth  fuzzy  cluster  satisfying 
£r=i  pi,r  =  1,  Vi,  m  is  the  fuzzy  factor,  and  Cr  is  the  fuzzy  cen¬ 
troid.  This  minimization  problem  can  be  solved  analytically  by  us¬ 
ing  Lagrange  multipliers,  and  the  optimization  can  be  achieved  by 
iteratively  updating  the  membership  function  and  fuzzy  centroids 
as  follows: 


(1/114  —  CJj|^)1^m-1) 


Cr 


nm  d 


The  fuzzy  factor  m  controls  the  fuzzyness  of  the  clustering  solu¬ 
tion.  In  general,  the  fuzzyness  of  the  clustering  solution  increases 
as  the  value  of  m  increases  and  vice  versa.  As  m  approaches  one, 
the  algorithm  behaves  more  like  standard  A'-means. 


Other  newly  developed  soft  clustering  algorithms  differ  from  fuzzy 
C-means  by  employing  different  dissimilarity  functions  [4,  29],  or 
by  including  both  a  soft  membership  function  and  a  weight  function 
(measuring  the  contribution  of  each  object  in  a  fuzzy  cluster)  in  the 
criterion  functions  (robust  fuzzy  C-means  [21]  and  A'-harmonic 
means  [43]).  Hamerly  and  Elkan  [17]  provided  an  interesting  com¬ 
parison  between  A'-means,  fuzzy  C-means,  A'-harmonic  means 
and  two  other  variates  to  show  the  effectiveness  of  various  soft 
membership  functions  and  weight  functions. 

Soft  clustering  has  been  applied  to  document  clustering  and  shown 
to  be  effective  [29,  25,  30],  One  of  the  limitations  of  classic  fuzzy 
C-means  in  document  clustering  is  the  use  of  Euclidean  distance. 
Hence,  the  focus  of  that  research  has  been  on  exploring  similarity 
measures  that  are  more  suitable  for  document  clustering,  for  ex¬ 
ample,  cosine  similarity  (Mendes  et  al.  [29]).  To  our  knowledge, 
extending  other  effective  criterion  functions  (for  example.  S\  and 
Q\)  with  soft  membership  functions  for  document  clustering  has 
not  been  studied  in  the  literature. 


Another  approach  to  soft  clustering  proposed  by  Backer  is  induced 
fuzzy  partitioning  [1],  The  key  idea  is  that  a  hard  clustering  so¬ 
lution  is  always  mentained  in  the  optimization  process.  The  opti¬ 
mization  process  starts  from  an  initial  hard  parition  and  consists  of 
a  number  of  iterations.  During  each  iteration,  the  soft  membership 
function  is  estimated  based  on  the  affinity  that  each  object  has  for 
hard  clusters  to  induce  a  soft  partition.  Then,  the  hard  partition  is 
modified  in  a  way  such  that  the  new  induced  soft  partition  leads  to 
a  better  criterion  function  value.  The  optimization  stops  when  no 
modification  of  the  hard  partition  can  be  made.  Notice  that  after 
the  optimization  process  terminates,  there  is  a  pair  of  clustering  so¬ 
lutions:  a  hard  clustering  solution  and  the  induced  soft  one.  Our 
proposed  optimization  algorithm  is  similar  to  induced  fuzzy  par¬ 
titioning  and  we  focus  on  evaluating  the  hard  clustering  solution 
obtained  by  this  optimization. 


4.  HARD  CLUSTERING  CRITERION  FUNC¬ 
TIONS 

A  key  characteristic  of  many  partitional  clustering  algorithms  is 
that  they  use  a  global  criterion  function  whose  optimization  drives 
the  entire  clustering  process.  For  some  of  these  algorithms  the  cri¬ 
terion  function  is  implicit  ( e.g .,  PDDP  [7]),  whereas  for  other  al¬ 
gorithms  ( e.g ,  A'-means  [28]  and  Autoclass  [8,  10])  the  criterion 
function  is  explicit  and  can  be  easily  stated.  This  later  class  of  al¬ 
gorithms  can  be  thought  of  as  consisting  of  two  key  components. 
First  is  the  criterion  function  that  needs  to  be  optimized  by  the  clus¬ 
tering  solution,  and  second  is  the  actual  algorithm  that  achieves  this 
optimization. 

Recently,  we  studied  seven  different  hard  partitional  clustering  cri¬ 
terion  functions  in  the  context  of  document  clustering,  which  op¬ 
timize  various  aspects  of  intra-cluster  similarity,  inter-cluster  dis¬ 
similarity,  and  their  combinations  [45.  44,  46],  Our  experiments 
showed  that  different  criterion  functions  lead  to  substantially  dif¬ 
ferent  results,  whereas  our  analysis  showed  that  their  performance 
depends  on  the  degree  to  which  they  can  correctly  operate  when 
the  dataset  contains  clusters  of  different  densities  (i.e.,  they  contain 
documents  whose  pairwise  similarities  are  different)  and  the  degree 
to  which  they  can  produce  balanced  clusters. 


higher  in  the  overall  clustering  solution. 

k 

minimize  £\  =  nr  cos(CV,  C).  (3) 

r= 1 


The  Q\  criterion  function  (Equation  4)  is  derived  by  modeling  the 
relationships  between  the  documents  using  the  document-to-document 
similarity  graph  Ga  [2,  42,  11].  Gs  is  obtained  by  treating  the 
pairwise  similarity  matrix  of  the  dataset  as  the  adjacency  matrix  of 
G3.  The  Q\  function  [13]  views  the  clustering  process  as  that  of 
partitioning  the  documents  into  groups  that  minimize  the  edge-cut 
of  each  partition.  However,  because  this  edge-cut-based  criterion 
function  may  have  trivial  solutions  the  edge-cut  of  each  cluster  is 
scaled  by  the  sum  of  the  cluster’s  internal  edges  [13].  Since  the 
similarity  between  each  pair  of  documents  is  measured  using  the 
cosine  function,  the  edge-cut  between  the  rth  cluster  and  the  rest 
of  the  documents  (i.e.,  cut(S'r,  S  —  Sr)  and  the  sum  of  the  internal 
edges  between  the  documents  of  the  rth  cluster  are  given  by  the 
numerator  and  denominator  of  Equation  4,  respectively. 


minimize 


*-E 


^idieSr.djeS-Sr  C0S(di,  dj ) 


(4) 


In  this  paper,  due  to  space  constraints,  we  focus  on  only  four  out  of 
these  seven  criterion  functions,  which  are  referred  to  as  X\,  X2, 
£1,  and  Q\  [45,  46],  This  subset  represents  some  of  the  most 
widely-used  criterion  functions  for  document  clustering,  and  in¬ 
cludes  some  of  the  best-  and  worst-performing  schemes.  A  short 
description  of  these  functions  is  presented  in  the  rest  of  this  sec¬ 
tion.  and  the  reader  should  consult  [45]  for  a  complete  description 
and  motivation. 

The  X\  criterion  function  (Equation  1)  maximizes  the  sum  of  the 
average  pairwise  similarities  (as  measured  by  the  cosine  function) 
between  the  documents  assigned  to  each  cluster  weighted  accord¬ 
ing  to  the  size  of  each  cluster  and  has  been  used  successfully  for 
clustering  document  datasets  [34]. 


fc 

maximize  X\  =  nr 

r= 1 


E 


cos (di,  dj 


(1) 


The  I2  criterion  function  (Equation  2)  is  used  by  the  popular  vector- 
space  variant  of  the  A'-means  algorithm  (also  referred  to  as  spher¬ 
ical  K-means)  [9,  26,  12,  37],  In  this  algorithm  each  cluster  is 
represented  by  its  centroid  vector  and  the  goal  is  to  find  the  solu¬ 
tion  that  maximizes  the  similarity  between  each  document  and  the 
centroid  of  the  cluster  that  is  assigned  to. 

fc 

maximize  X2  =  EE  COS  (di,Cr).  (2) 

r= 1  diCSr- 


The  £\  criterion  function  (Equation  3)  computes  the  clustering  by 
finding  a  solution  that  separates  the  documents  of  each  cluster  from 
the  entire  collection.  Specifically,  it  tries  to  minimize  the  cosine 
between  the  centroid  vector  of  each  cluster  and  the  centroid  vector 
of  the  entire  collection.  The  contribution  of  each  cluster  is  weighted 
proportionally  to  its  size  so  that  larger  clusters  will  be  weighted 


5.  SOFT  CLUSTERING  CRITERION  FUNC¬ 
TIONS 

A  natural  and  straight-forward  way  of  deriving  soft  clustering  solu¬ 
tions  is  to  assign  each  document  to  multiple  clusters.  This  is  usu¬ 
ally  achieved  by  using  membership  functions  [29,  17,  3,  1]  that 
for  each  document  d,  and  cluster  Sj ,  they  compute  a  non-negative 
weight,  denoted  by  rriij,  such  that  ^  .  rriij  =  1,  which  indicates 
the  extent  to  which  document  di  belongs  to  cluster  Sj.  Thus,  we 
can  think  of  the  various  rrnj  values  as  the  fraction  by  which  di 
belongs  to  cluster  Sj.  Note  that  in  the  case  of  a  hard  clustering 
solution,  for  each  document  di  one  of  these  rnt.j  values  is  one  (the 
one  corresponding  to  the  cluster  that  ck  belongs  to)  and  the  rest  will 
be  zero. 

Using  these  membership  functions,  the  soft  clustering  extensions  of 
the  hard  criterion  functions  described  in  Section  4  can  be  derived 
as  follows. 


Soft  X\  Criterion  Function.  Since  each  cluster  can  now  con¬ 
tain  fractions  of  all  the  documents,  a  natural  way  of  measuring  the 
overall  pairwise  similarity  between  the  documents  assigned  to  each 
cluster  is  to  take  into  account  their  membership  functions.  Specif¬ 
ically,  we  can  compute  the  pairwise  similarity  between  the  (frac¬ 
tional)  documents  assigned  to  the  rth  soft  cluster  as 

/// ,  r  pj  ,r  COS  (di,  dj  )  . 


Similarly,  we  can  compute  the  soft  size  nr  of  the  rth  soft  cluster  as 
nr  =  ff  j  m;,r.  Using  these  definitions,  then  the  soft  X\  criterion 
function,  denoted  by  SX\,  is  defined  as  follows: 


maximize 


/./, .  v  jtj ,  r  COS  (di,  dj 


(5) 


Soft  X2  Criterion  Function.  A  soft  version  of  the  X2  criterion 
function  can  be  obtained  by  extending  the  notion  of  the  cluster’s 
centroid  vector  to  soft  clusters.  Since  each  soft  cluster  contains 


fractions  of  documents,  its  centroid  vector  should  also  be  calculated 
based  on  the  fractional  documents  that  it  contains.  Specifically,  we 
can  define  the  soft  centroid  vector  of  the  rth  soft  cluster  Cr  as 

EN  7 

_  i=1  P i,rCLi 


which  takes  into  account  the  fractional  membership  of  each  docu¬ 
ment  and  its  soft  size.  Using  the  above  definition,  the  soft  I2  cri¬ 
terion  function,  denoted  by  SI 2,  can  be  obtained  by  requiring  the 
clustering  solution  to  maximize  the  similarity  between  the  (frac¬ 
tional)  documents  assigned  to  a  soft  cluster  and  its  centroid.  This 
is  formally  defined  as  follows: 


k 

maximize  SI 2  = 

r= 1 


jl '< ,  r  COs(di,  Cr 


(6) 


Soft  h  Criterion  Function.  The  Si  criterion  function  tries  to 
separate  the  centroid  of  each  cluster  front  that  of  the  entire  collec¬ 
tion  and  weights  each  cluster  by  its  size.  Thus,  the  new  element  be¬ 
ing  introduced  in  trying  to  develop  a  soft  version  of  the  Si  criterion 
function  (over  those  introduced  by  Ii  and  ID  is  the  notion  of  the 
collection  centroid.  In  our  soft  formulation  of  Si.  we  compute  this 
centroid  by  treating  the  entire  collection  as  one  soft  cluster.  In  this 
case,  the  value  of  the  membership  function  for  each  document  to 
this  cluster  is  one,  and  as  a  result,  the  soft  collection  centroid  is  the 
same  as  that  for  hard  clustering;  that  is,  C  =  C  =  Sili  di/N). 
Given  this  definition  and  the  earlier  definitions  of  soft  cluster  cen¬ 
troid  and  soft  cluster  size,  the  soft  Si  criterion  function,  denoted  by 
SQi,  is  defined  as  follows: 

k 

minimize  SSi  =  nr  cos(Cr,  C).  (7) 

r=  1 


Soft  Q 1  Criterion  Function.  In  order  to  develop  a  soft  version 
of  the  Qi  criterion  function  we  need  to  properly  define  (i)  the  edge- 
cut  between  a  cluster  and  the  rest  of  the  documents  in  the  collection, 
and  (ii)  the  sum  of  the  weights  of  the  edges  between  the  documents 
in  each  cluster.  Since  the  weights  of  the  edges  between  each  pair 
of  documents  corresponds  to  the  cosine  similarity  between  their 
respective  documents  vectors,  both  of  the  above  quantities  can  be 
easily  obtained  by  extending  the  expressions  in  Equation  4  to  take 
into  account  the  membership  functions.  Specifically,  we  can  com¬ 
pute  the  soft  version  of  the  sum  of  the  weights  of  the  edges  between 
the  documents  of  the  rth  cluster  (the  denominator  of  Equation  4)  as 
S;  j  Fi,rFi,r  cos (di,  dj).  Similarly,  since  the  rth  cluster  contains 
fractions  of  all  the  documents,  the  fractions  of  the  documents  that 
are  not  assigned  to  this  cluster  are  the  fractions  that  belong  to  the 
cluster  corresponding  to  the  “rest”  of  the  documents.  As  a  result, 
we  can  compute  the  edge-cut  between  the  rth  cluster  and  the  rest  of 
the  documents  in  the  collection  as  ff,  j  /A,r(l  —  Fj, r)  cos(di,  dj). 
Using  these  definitions,  the  soft  version  of  the  Qi  criterion  function, 
denoted  by  SQi,  is  defined  as  follows: 


minimize 


k 


SQi  =  £ 

r= 1 


j  LH,r(  1  Fj.r)  COs(rfi,  dj) 
y  .j  j  Fi,r t^j ,r  COS ( (1/  .  dj) 


(8) 


6.  SOFT  PARTITIONAL  CLUSTERING  AL¬ 
GORITHM 


Our  focus  thus  far  has  been  on  developing  soft-clustering  exten¬ 
sions  for  four  different  criterion  functions  that  are  used  to  obtain 
hard-clustering  solutions.  We  now  turn  our  attention  on  developing 


algorithms  that  compute  clustering  solutions  that  optimize  each  of 
these  criterion  functions. 

Traditionally,  soft  clustering  algorithms  are  derived  by  analytically 
optimizing  their  respective  criterion  functions  using  Lagrange  mul¬ 
tipliers  ( e.g .,  fuzzy  C-means  algorithm  [3]).  This  analytical  ap¬ 
proach  leads  directly  to  an  iterative  strategy  that  determines  the 
values  of  the  various  membership  functions  by  which  the  overall 
criterion  function  is  optimized.  Even  though  this  approach  can  be 
applied  to  optimize  the  SI 2  criterion  function  [29],  analytically 
deriving  such  optimization  iterations  for  the  SI  1,  SS 1,  and  SQi 
functions  is  hard  if  not  impossible.  For  this  reason,  we  developed 
a  soft  partitional  clustering  algorithm  that  determines  the  values 
of  the  membership  functions  of  the  various  documents  following 
the  induced  fuzzy  partitioning  approach  [1],  and  optimizes  the  soft 
criterion  functions  using  a  hard-clustering  based  optimization  ap¬ 
proach. 


6.1  Determining  the  Membership  Functions 

Given  a  hard  k- way  clustering  solution  S2,  ■■.,  Sk},  we  define 

the  membership  of  document  di  to  cluster  Sj  to  be 


Fi,j 


cos  (di,  Cj 


E‘=ic°s(i.Cr)m 

where  Cr  is  the  centroid  of  the  hard  cluster  Sr. 


(9) 


The  parameter  m  in  Equation  9  is  the,  fuzzy  factor  and  controls  the 
“softness”  of  the  membership  function  and  hence  the  “softness” 
of  the  clustering  solution  (the  inclusion  of  the  fuzzy  factor  was 
motivated  by  the  formulation  of  the  fuzzy  C-means  algorithm). 
When  m  is  equal  to  zero,  the  membership  values  of  a  document 
to  each  cluster  are  the  same  (i.e.,  there  is  no  preference  to  a  partic¬ 
ular  cluster).  On  the  other  hand,  as  m  approaches  infinity,  the  soft 
membership  function  becomes  the  hard  membership  function  (i.e., 
Hij  =  1,  if  di  is  most  close  to  Sj ;  (mj  =  0,  otherwise).  In  general, 
the  softness  of  the  clustering  solution  increases  as  the  value  of  m 
decreases  and  vice  versa. 


6.2  Determining  the  Clusters 

As  we  mentioned  in  Section  3.  a  hard-clustering  based  optimization 
approach  results  in  a  pair  of  clustering  solutions:  a  hard  clustering 
solution  and  the  induced  soft  clustering  solution.  In  this  paper,  we 
focus  on  the  hard  clustering  solution  and  used  a  clustering  approach 
that  determines  the  overall  k- way  clustering  solution  by  performing 
a  sequence  of  cluster  bisections.  In  this  approach,  a  k- way  solution 
is  obtained  by  first  bisecting  the  entire  collection.  Then,  one  of  the 
two  clusters  is  selected  and  it  is  further  bisected,  leading  to  a  total 
of  three  clusters.  The  process  of  selecting  and  bisecting  a  partic¬ 
ular  cluster  continues  until  k  clusters  are  obtained.  This  repeated 
bisectioning  approach  was  motivated  for  two  reasons.  First,  re¬ 
cent  studies  on  hard  partitional  clustering  [37.  46]  have  shown  that 
such  an  approach  leads  to  better  clustering  solutions  than  the  tradi¬ 
tional  approach  that  computes  the  k- way  solution  directly.  Second, 
it  leads  to  an  algorithm  that  has  a  lower  computational  complexity 
(in  most  cases  it  is  faster  by  an  order  of  k). 

Each  of  these  bisections  is  performed  in  two  steps.  During  the  first 
step,  an  initial  clustering  solution  is  obtained  by  randomly  assign¬ 
ing  the  documents  to  two  clusters.  During  the  second  step,  the  ini¬ 
tial  clustering  is  repeatedly  refined  so  that  it  optimizes  the  desired 
clustering  criterion  function. 

The  refinement  strategy  consists  of  a  number  of  iterations.  During 


each  iteration,  the  documents  are  visited  in  a  random  order.  For 
each  document,  d,  we  compute  the  change  in  the  value  of  the  soft 
criterion  function  obtained  by  moving  d  to  the  other  cluster.  This  is 
done  by  deriving  the  membership  values  for  the  original  and  mod¬ 
ified  hard  clustering  solution  (i.e.,  assuming  that  d  moved  to  the 
other  cluster)  and  then  calculate  the  change  of  the  soft  criterion 
function.  If  the  change  improves  the  criterion  function,  then  d  is 
moved  to  the  cluster.  If  no  such  cluster  exists,  d  remains  in  the  clus¬ 
ter  that  it  already  belongs  to.  The  refinement  phase  ends,  as  soon 
as  we  perform  an  iteration  in  which  no  documents  moved  between 
clusters.  The  detailed  pseudo-code  and  algorithm  description  refer 
to  Algorithm  6.1. 


Algorithm  6.1:  Soft2WayRefine(S'i,  S2) 


C i  <—  centroid  of  Si 
C2  <—  centroid  of  S2 

<—  membership  values  using  Ci  and  C2 
T  <—  fuzzy  criterion  function  value 

while  movements  are  made 
'  for  each  d  £  Si  U  §2 

(Si,  S2)  <—  2-way  clustering  after  moving  d 
C[  <—  centroid  of  S[ 

C2  <—  centroid  of  S2 

(j't  j  <—  membership  values  using  C[  and  C2 
do  <  .  T1  <—  fuzzy  criterion  function  value 

°  *  if  T'  is  better  than  T 
(Si^S'i 

..  I  S2  <-  S'2 

then  <  T  T' 

,  .  *  Pi,j  j 


return  (Si,  S2) 


Note  that  unlike  the  traditional  refinement  approach  used  by  K- 
means  type  of  algorithms,  the  above  algorithm  moves  a  document 
as  soon  as  it  is  determined  that  it  will  lead  to  an  improvement  in 
the  value  of  the  criterion  function.  This  type  of  refinement  algo¬ 
rithms  are  often  called  incremental  [14].  Since  each  move  directly 
optimizes  the  particular  criterion  function,  this  refinement  strategy 
always  converges  to  a  local  minima. 

The  greedy  nature  of  the  refinement  algorithm  does  not  guarantee 
that  it  will  converge  to  a  global  minima,  and  the  local  minima  solu¬ 
tion  it  obtains  depends  on  the  particular  set  of  seed  documents  that 
were  selected  during  the  initial  clustering.  To  eliminate  some  of  this 
sensitivity,  the  overall  process  is  repeated  a  number  of  times.  That 
is,  we  compute  N  different  clustering  solutions  (i.e.,  initial  clus¬ 
tering  followed  by  cluster  refinement),  and  the  one  that  achieves 
the  best  value  for  the  particular  criterion  function  is  kept.  In  all  of 
our  experiments,  we  used  N  =  10.  For  the  rest  of  this  discussion 
when  we  refer  to  the  clustering  solution  we  will  mean  the  solution 
that  was  obtained  by  selecting  the  best  out  of  these  N  potentially 
different  solutions. 

6.2.1  Cluster  Selection 

A  key  step  in  this  repeated  bisection  algorithm  is  the  method  used 
to  select  which  cluster  to  bisect  next.  In  our  experiments,  we  used  a 
simple  strategy  of  bisecting  the  largest  cluster  available  at  that  point 
of  the  clustering  solution.  Our  earlier  experience  with  this  approach 


showed  that  it  leads  to  reasonably  good  and  balanced  clustering 
solutions  [37,  45]. 

We  also  used  a  strategy  to  stop  bisecting  a  cluster.  Specifically,  if 
after  the  bisection,  one  of  the  resulted  two  clusters  contains  less 
than  5%  of  all  the  documents,  we  consider  that  the  cluster  is  not 
separable  according  to  the  criterion  function.  In  such  cases,  we 
keep  the  cluster  as  it  is  and  do  not  select  it  for  further  bisections. 
Thus,  the  number  of  clusters  returned  by  our  algorithm  could  be 
smaller  than  the  number  of  required  clusters  as  the  input  (if  all  the 
resulted  clusters  meet  the  stop  condition). 

6.2.2  Computational  Complexity 

Each  iteration  of  the  refinement  of  a  2-way  clustering  of  a  set  of 
documents  requires  the  examination  of  the  movement  of  each  one 
of  the  documents  to  the  other  cluster.  During  this  process,  the  most 
expensive  computation  is  the  calculation  of  the  membership  values 
which  need  to  be  updated  for  all  the  documents.  Thus,  the  time 
complexity  of  each  iteration  is  0(n2).  If  we  assume  that  each  suc¬ 
cessive  bisection  splits  the  documents  into  two  roughly  equal-size 
clusters  and  that  we  follow  a  larger-cluster  selection  strategy,  then 
the  overall  amount  of  time  required  to  compute  all  k  —  1  bisections 
is  0(n2). 

7.  EXPERIMENTAL  RESULTS 

We  experimentally  evaluated  the  performance  of  the  various  soft 
criterion  functions  and  compared  them  with  the  corresponding  hard 
criterion  functions  using  a  number  of  different  datasets.  In  the  rest 
of  this  section  we  first  describe  the  various  datasets  and  our  experi¬ 
mental  methodology,  followed  by  a  description  of  the  experimental 
results. 

7.1  Document  Collections 

In  our  experiments,  we  used  a  total  of  twelve  different  datasets, 
whose  general  characteristics  are  summarized  in  Table  1 .  The  small¬ 
est  of  these  datasets  contained  356  documents  and  the  largest  con¬ 
tained  1,170  documents.  To  ensure  diversity  in  the  datasets,  we 
obtained  them  from  different  sources.  For  all  datasets,  we  used  a 
stop-list  to  remove  common  words,  and  the  words  were  stemmed 
using  Porter’s  suffix-stripping  algorithm  [33].  Moreover,  any  term 
that  occurs  in  fewer  than  two  documents  was  eliminated. 

The  hitech  and  sports  datasets  were  derived  front  the  San  Jose  Mer¬ 
cury  newspaper  articles  that  are  distributed  as  part  of  the  TREC 
collection  (TIPSTER  Vol.  3).  Each  one  of  these  datasets  were 
constructed  by  selecting  documents  that  are  part  of  certain  topics 
in  which  the  various  articles  were  categorized  (based  on  the  DE¬ 
SCRIPT  tag).  The  hitech  dataset  contained  documents  about  com¬ 
puters,  electronics,  health,  medical,  research,  and  technology;  and 
the  sports  dataset  contained  documents  about  baseball,  basketball, 
bicycling,  boxing,  football,  golfing,  and  hockey.  The  datasets  kla, 
klb,  and  wap  are  from  the  WebACE  project  [31,  18,  5,  6],  Each 
document  corresponds  to  a  web  page  listed  in  the  subject  hierarchy 
of  Yahoo!  [40],  The  datasets  kla  and  klb  contain  exactly  the  same 
set  of  documents  but  they  differ  in  how  the  documents  were  as¬ 
signed  to  different  classes.  In  particular,  kla  contains  a  fmer-grain 
categorization  than  that  contained  in  klb.  Th efbis  dataset  is  from 
the  Foreign  Broadcast  Information  Service  data  of  TREC-5  [39], 
and  the  classes  correspond  to  the  categorization  used  in  that  col¬ 
lection.  The  lal  and  la2  datasets  were  obtained  from  articles  of 
the  Los  Angeles  Times  that  was  used  in  TREC-5  [39].  The  cate¬ 
gories  correspond  to  the  desk  of  the  paper  that  each  article  appeared 


Table  1:  Summary  of  data  sets  used  to  evaluate  the  various  clustering  criterion  functions. 


Data 

Source 

#  ot  documents 

#  ot  terms 

#  ot  classes 

hitech 

San  Jose  Mercury  (TREC) 

r  767 

7599 

6 

sports 

San  Jose  Mercury  (TREC) 

858 

7163 

7 

reutersl 

Reuters-21578 

908 

10582 

3 

odp 

Open  Directory  Project 

356 

551 

3 

inspecl 

Scientific  Database 

920 

11803 

3 

wap 

WebACE 

780 

7131 

20 

kla 

WebACE 

1170 

9527 

20 

klb 

WebACE 

1170 

9781 

6 

fbis 

FBIS  (TREC) 

821 

1997 

17 

lal 

LA  Times  (TREC) 

801 

8449 

6 

la2 

LA  Times  (TREC) 

769 

8333 

6 

rel 

Reuters-21578 

829 

3221 

25 

and  include  documents  from  the  entertainment,  financial,  foreign, 
metro,  national,  and  sports  desks.  The  dataset  reO  is  from  Reuters- 
21578  text  categorization  test  collection  Distribution  1.0  [27],  The 
renters  1,  odp ,  and  inspecl  datasets  are  the  datasets  used  by  [29]. 
Refer  to  [29]  for  detailed  information  about  these  datasets.  As  a 
summary,  reutersl  was  derived  from  the  Reuters-21578  collection. 
The  reutersl  dataset  contains  documents  from  three  classes:  trade, 
acq  and  earn.  The  odp  dataset  was  derived  from  the  open  direc¬ 
tory  project  and  consists  of  three  classes:  drugs,  health,  and  sport. 
Finally,  the  inspecl  dataset  was  derived  from  a  scientific  database 
and  the  documents  are  on  the  topics  of  back-propagation,  fuzzy 
control,  and  pattern  classification.  Note  that  all  the  datasets  used  in 
our  study  do  not  contain  documents  with  multiple  class  labels.  The 
original  odp  and  inspecl  datasets  in  [29]  contain  some  documents 
with  multiple  class  labels.  For  the  purpose  of  our  study,  we  only 
selected  those  documents  that  only  belong  to  one  class. 

7.2  Experimental  Methodology  and  Metrics 

For  each  one  of  the  different  datasets  we  obtained  a  10- way  and 
20-way  clustering  solution  that  optimized  the  various  hard  and  soft 
clustering  criterion  functions  (Equations  1-  8).  Specifically,  for 
each  hard  criterion  function,  we  compared  it  with  the  correspond¬ 
ing  soft  criterion  functions  with  a  fuzzy  factor  m  of  different  val¬ 
ues.  The  quality  of  a  clustering  solution  was  evaluated  using  the 
entropy  measure  that  is  based  on  how  the  various  classes  of  docu¬ 
ments  are  distributed  within  each  cluster. 

Given  a  particular  cluster  ST  of  size  nr,  the  entropy  of  this  cluster 
is  defined  to  be 


where  q  is  the  number  of  classes  in  the  dataset  and  rtr  is  the  number 
of  documents  of  the  ith  class  that  were  assigned  to  the  rth  cluster. 
The  entropy  of  the  entire  solution  is  defined  to  be  the  sum  of  the 
individual  cluster  entropies  weighted  according  to  the  cluster  size, 
i.e., 

k 

Entropy  =  ^  —E(Sr). 

r= 1 

A  perfect  clustering  solution  will  be  the  one  that  leads  to  clusters 
that  contain  documents  from  only  a  single  class,  in  which  case  the 
entropy  will  be  zero.  In  general,  the  smaller  the  entropy  values,  the 
better  the  clustering  solution  is. 

To  eliminate  any  instances  that  a  particular  clustering  solution  for  a 
particular  criterion  function  got  trapped  into  a  bad  local  optimum, 


in  all  of  our  experiments  we  found  ten  different  clustering  solu¬ 
tions.  As  discussed  in  Section  6.2  each  of  these  ten  clustering  so¬ 
lutions  correspond  to  the  best  solution  (in  terms  of  the  respective 
criterion  function)  out  of  ten  different  initial  partitioning  and  re¬ 
finement  phases. 

7.3  Comparison  of  the  Hard  and  Soft  Crite¬ 
rion  Functions 

Our  experiments  were  focused  on  evaluating  the  quality  of  the  clus¬ 
tering  solutions  produced  by  the  various  hard  and  soft  criterion 
functions  when  they  were  used  to  compute  a  k- way  clustering  so¬ 
lution  via  repeated  bisections.  The  clustering  solutions  of  various 
hard  criterion  functions  were  obtained  by  using  CLUTO  [23].  The 
results  for  the  various  datasets  and  criterion  functions  for  10-,  and 
20- way  clustering  solutions  are  shown  in  Tables  2  and  3,  respec¬ 
tively. 

The  results  for  each  dataset  are  shown  in  each  subtable,  in  which 
each  column  corresponds  to  one  of  the  four  criterion  functions.  The 
results  of  the  soft  criterion  functions  with  various  fuzzy  factor  val¬ 
ues  are  shown  in  the  first  five  rows  (labeled  by  the  fuzzy  factor 
values),  and  those  of  the  various  hard  criterion  functions  are  shown 
in  the  last  row.  The  entries  that  are  boldfaced  correspond  to  the  best 
values  obtained  for  each  column,  (i.e.,  for  each  criterion  function, 
the  best  value  among  the  hard  and  various  soft  criterion  functions 
with  different  m  values  for  each  dataset),  whereas  the  underlined 
entries  correspond  to  the  best  values  obtained  among  all  the  crite¬ 
rion  functions  for  each  dataset. 

A  number  of  observations  can  be  made  by  analyzing  the  results  in 
Table  2.  First,  for  most  datasets,  introducing  softness  into  each  one 
of  the  four  criterion  functions  improved  the  quality  of  the  cluster¬ 
ing  solutions,  and  different  trends  can  be  observed  in  the  relative 
improvement  for  different  criterion  functions.  The  STi  criterion 
function  outperformed  T\  on  eight  datasets,  among  which  the  rela¬ 
tive  improvements  were  greater  than  10%  for  six  datasets  with  the 
largest  improvement  of  23%.  The  effect  of  introducing  softness 
was  less  significant,  but  more  consistent  for  both  ST 2  and  SGi 
than  that  for  ST  1 .  For  only  three  datasets,  ST 2  and  SGi  performed 
better  than  I2  and  Gi,  respectively,  by  more  than  10%.  However, 
ST 2  and  SGi  outperformed  I2  and  Q 1  on  ten  and  nine  datasets, 
respectively.  S£i  benefits  the  least  with  improvements  observed 
on  seven  datasets  and  improvements  greater  than  10%  observed  on 
two  datasets.  Second,  the  fuzzy  factor  values  that  achieved  the  best 
clustering  solutions  seemed  to  vary  for  different  datasets,  which 
suggests  that  the  proper  fuzzy  factor  values  may  relate  to  some 
characteristics  of  the  datasets  and  their  class  conformations.  Fi¬ 
nally,  SG 1  was  less  sensitive  to  the  choice  of  fuzzy  factor  values 


Table  2:  The  Entropies  of  the  clustering  solutions  obtained  by  hard  and  soft  criterion  functions  with  various  fuzzy  factors. 


reutersl 

10-way  i 

Ti 

I2 

£1 

Qi 

0.228 

0.229 

0.287 

0.283 

0.205 

0.194 

0.232 

0.222 

0.194 

0.210 

0.226 

0.201 

0.264 

0.231 

0.298 

0.214 

0.359 

0.296 

0.370 

0.255 

0.250 

0.218 

0.262 

0.248 

|  sports  10-way  | 

Ti 

t2 

£1 

Qi 

0.431 

0.245 

0.208 

0.200 

0.497 

0.248 

0.219 

0.161 

0.270 

0.250 

0.119 

0.170 

0.274 

0.177 

0.106 

0.156 

0.294 

0.204 

0.133 

0.161 

0.252 

0.181 

0.158 

0.185 

j  hitech  10-way  | 

Method 

T\ 

t2 

£1 

Qi 

m=  1 

0.757 

0.644 

0.674 

0.584 

m  =  2 

0.627 

0.639 

0.618 

0.596 

m  =  4 

0.616 

0.612 

0.615 

0.599 

m  =  6 

0.611 

0.586 

0.586 

0.603 

m  =  8 

0.618 

0.594 

0.582 

0.587 

hard 

0.644 

0.610 

0.573 

0.585 

wap  10-way 

T\ 

t2 

£1 

Q 1 

0.586 

0.540 

0.513 

0.386 

0.570 

0.412 

0.411 

0.387 

0.595 

0.385 

0.399 

0.376 

0.456 

0.415 

0.398 

0.371 

0.454 

0.443 

0.412 

0.409 

0.421 

0.414 

0.408 

0.381 

j  Inspecl  10-way  j 

Ti 

t2 

£1 

(h 

0.326 

0.324 

0.330 

0.298 

0.365 

0.303 

0.297 

0.303 

0.469 

0.297 

0.300 

0.295 

0.390 

0.306 

0.286 

0.300 

0.400 

0.302 

0.320 

0.297 

0.441 

0.293 

0.283 

0.290 

j  odp  10-way  [ 

Method 

T\ 

t2 

£1 

Q 1 

m  =  1 

0.236 

0.210 

0.216 

0.224 

m  =  2 

0.277 

0.246 

0.218 

0.247 

m  =  4 

0.326 

0.289 

0.309 

0.282 

m  =  6 

0.379 

0.327 

0.335 

0.308 

m  =  8 

0.421 

0.308 

0.355 

0.346 

hard 

0.293 

0.283 

0.233 

0.256 

kla  10-way 

Method 

T\ 

t2 

£1 

Q 1 

m  =  1 

0.601 

0.585 

0.530 

0.432 

m  =  2 

0.573 

0.519 

0.437 

0.413 

m  =  4 

0.584 

0.418 

0.407 

0.443 

m  =  6 

0.448 

0.429 

0.406 

0.418 

m  =  8 

0.487 

0.462 

0.436 

0.435 

hard 

0.460 

0.434 

0.410 

0.419 

|  fbis  10-way  | 

Ti 

t2 

£1 

Qi 

0.525 

0.527 

0.510 

0.416 

0.529 

0.429 

0.424 

0.387 

0.375 

0.398 

0.404 

0.372 

0.379 

0.380 

0.394 

0.378 

0.399 

0.387 

0.430 

0.393 

0.396 

0.396 

0.398 

0.382 

|  klb  10-way  | 

Ti 

t2 

£1 

Qi 

0.240 

0.223 

0.200 

0.160 

0.347 

0.205 

0.181 

0.112 

0.259 

0.153 

0.125 

0.157 

0.238 

0.174 

0.148 

0.113 

0.249 

0.226 

0.194 

0.189 

0.169 

0.154 

0.153 

0.133 

rel  10-way 

Ti 

t2 

£1 

Qi 

0.462 

0.500 

0.489 

0.422 

0.473 

0.451 

0.417 

0.417 

0.424 

0.386 

0.394 

0.415 

0.424 

0.409 

0.396 

0.397 

0.442 

0.393 

0.405 

0.412 

0.414 

0.397 

0.396 

0.404 

la2  10-way 

h 

t2 

£1 

Qi 

0.832 

0.808 

0.788 

0.408 

0.841 

0.401 

0.393 

0.377 

0.381 

0.388 

0.363 

0.380 

0.408 

0.358 

0.380 

0.369 

0.357 

0.374 

0.418 

0.417 

0.457 

0.400 

0.338 

0.367 

lal  10-way 

Method 

h 

t2 

£1 

Qi 

m  =  1 

0.784 

0.747 

0.714 

0.486 

m  =  2 

0.756 

0.559 

0.463 

0.428 

m  =  4 

0.687 

0.459 

0.456 

0.419 

m  =  6 

0.459 

0.423 

0.469 

0.422 

m  =  8 

0.465 

0.448 

0.467 

0.421 

hard 

0.519 

0.423 

0.413 

0.418 

than  the  other  three  criterion  functions.  Similar  trends  can  be  ob¬ 
served  from  Table  3  as  well. 

Note  that  for  the  fbis  and  rel  datasets,  the  results  of  10-  and  20- 
way  clustering  solutions  are  the  same  for  the  T\  and  various  ST\ 
criterion  functions.  That  is  because  we  employed  a  strategy  to  stop 
bisecting  a  cluster  as  described  in  Section  6.2.1.  The  actual  number 
of  clusters  obtained  by  the  Ti  and  various  ST\  criterion  functions 
is  smaller  than  ten  for  both  fbis  and  rel. 

8.  DISCUSSION 

The  experimental  results  presented  in  the  previous  section  suggest 
that  for  most  datasets  soft  criterion  functions  can  improve  the  qual¬ 
ity  of  the  clustering  solutions.  In  this  section,  we  look  at  the  clus¬ 
tering  solutions  obtained  by  soft  criterion  functions  more  carefully 
and  identify  some  of  the  different  characteristics  observed  in  clus¬ 
tering  solutions  obtained  by  hard  and  soft  criterion  functions. 

8.1  The  effect  of  fuzzy  factor 

We  first  look  at  how  the  fuzzy  factor  effects  the  moves  made  by  the 
various  soft  criterion  functions.  Recall  that  the  fuzzy  factor  controls 
the  “softness”  of  the  membership  function  and  hence  the  “softness” 
of  the  clustering  solutions.  As  m  increases,  the  soft  membership 


function  becomes  the  hard  membership  function  (i.e.,  (Mj  =  1, 
if  di  is  most  close  to  Sj ;  fMj  =  0.  otherwise),  and  consequently 
soft  criterion  functions  become  hard  criterion  functions.  Hence,  for 
every  move  made  by  soft  criterion  functions,  if  we  also  compute  the 
gain  of  the  corresponding  hard  criterion  function,  we  would  expect 
that  the  agreement  between  soft  and  hard  criterion  functions  will 
increase  as  m  increases. 

Figure  1(a)  shows  the  average  percentages  of  moves  that  were  made 
by  the  various  soft  criterion  functions  and  agreed  by  the  corre¬ 
sponding  hard  criterion  function.  The  percentage  values  were  aver¬ 
aged  over  all  the  datasets.  As  shown  in  Figure  1(a),  as  m  increases, 
we  do  see  a  trend  of  increasing  agreement  between  the  soft  and  the 
corresponding  hard  criterion  functions  for  ST i,  ST 2.  and  S£\. 

One  of  the  interesting  observations  is  that  even  though  the  degree 
to  which  the  hard  and  soft  criterion  functions  agree  increases  with 
increasing  m,  it  does  not  reach  very  high  values  (i.e.,  it  does  not  ap¬ 
proach  100%).  This  is  true  even  for  large  values  of  m  (not  shown 
in  the  graph).  The  reason  for  that  is  that  the  hard  clustering  solution 
induced  by  the  soft  clustering  algorithm  will  assign  each  document 
to  the  cluster  for  which  it  has  the  highest  membership  function. 
However,  this  cluster  may  not  necessarily  be  the  one  that  optimizes 


Table  3:  The  Entropies  of  the  clustering  solutions  obtained  by  hard  and  soft  criterion  functions  with  various  fuzzy  factors. 


renters’ 

20-way 

Xi 

I2 

£ 1 

Gi 

0.211 

0.233 

0.189 

im 

0.183 

0.211 

0.185 

0.200 

wm 

0.259 

B 

0.218 

0.198 

0.159 

0.206 

0.186 

sports  20-way 

Ti 

t2 

£1 

G 1 

0.187 

0.150 

0.175 

0.131 

0.133 

0.221 

0.125 

0.109 

0.177 

0.138 

0.122 

0.141 

j  hitech  20-way  j 

Method 

2i 

t2 

1  £1 

Gi 

m=  1 

0.746 

0.591 

mm 

0.513 

m  =  2 

0.586 

Era 

m  =  4 

0.551 

Hi 

m  =  6 

0.523 

Ega 

m  =  8 

0.532 

ini 

hard 

|  0.592 

0.553 

|  0.524 

|  0.541  | 

wap  20-way 

h 

t2 

£1 

G 1 

0.578 

0.523 

0.494 

0.324 

0.559 

0.321 

0.309 

0.585 

0.308 

0.283 

0.377 

0.311 

0.321 

0.384 

0.335 

0.325 

0.317 

0.326 

0.329 

0.319 

0.307 

j  odp  20-way  j 

Method 

T\ 

t2 

1  £1 

1  G 1  1 

m  =  1 

0.227 

0.153 

m  =  2 

0.234 

m  =  4 

0.295 

0.245 

0.219 

m  =  6 

0.277 

0.255 

0.257 

0.253 

m  =  8 

0.267 

0.268 

0.297 

hard 

|  0.281 

0.228 

0.201 

0.217 

Inspecl  20-way 

Ti 

t2 

£1 

Gi 

0.326 

0.324 

0.282 

0.363 

Era 

0.290 

0.469 

0.286 

Wm 

0.281 

0.374 

0.291 

Era 

0.281 

0.376 

0.290 

Era 

0.284 

0.427 

0.283 

0.270 

0.275 

kla  20-way 

Method 

Ti 

t2 

£1 

Gi 

m  =  1 

0.596 

0.579 

0.363 

m  =  2 

0.495 

0.343 

m  =  4 

0.340 

m  =  6 

0.339 

0.333 

0.344 

m  =  8 

0.364 

0.345 

0.361 

hard 

0.376 

0.347 

0.349 

0.334 

fbis  20-way 

Ti 

I2 

£1 

Gi 

0.469 

0.343 

Era 

0.374 

0.362 

0.311 

0.375 

0.321 

0.330 

0.379 

0.399 

0.319 

0.319 

0.396 

0.322 

0.329 

0.316 

klb  20-way 

Ti 

I2 

£1 

Gi 

0.226 

0.187 

0.125 

0.344 

0.181 

0.123 

0.096 

0.256 

0.107 

0.120 

0.139 

0.110 

0.125 

0.099 

0.152 

0.157 

0.154 

0.144 

0.155 

0.076 

0.105 

0.091 

la2  20- way 

Ti 

I2 

£1 

G 1 

0.832 

0.808 

0.788 

0.364 

0.841 

0.359 

0.348 

0.326 

0.341 

0.326 

0.334 

0.353 

0.313 

0.329 

0.319 

0.335 

0.333 

0.374 

0.355 

0.390 

0.333 

0.307 

0.334 

rel  20- way 

h 

t2 

£1 

G 1 

0.462 

0.434 

0.414 

0.473 

0.379 

0.424 

0.280 

0.424 

0.299 

0.442 

0.317 

0.326 

0.414 

0.299 

0.287 

0.300 

!  lal  20-way  j 

Method 

Tx 

t2 

;  £1 

G 1 

m  =  1 

0.783 

0.746 

0.426 

m  =  2 

0.747 

0.543 

m  =  4 

0.671 

0.414 

m  =  6 

0.410 

0.402 

0.391 

m  =  8 

EMI 

0.416 

0.404 

0.383 

hard 

1 0.447 

0.385 

0.379 

0.386 

the  respective  hard  criterion  function.  Despite  this  fact,  the  trend, 
that  the  agreement  between  soft  and  the  corresponding  hard  crite¬ 
rion  functions  increases  as  m  increases,  is  still  valid  as  shown  in 
Figure  1(a). 

For  SQ i,  the  agreement  between  SG i  and  Gi  seems  less  sensitive 
to  the  increasing  of  fuzzy  factor  values,  which  is  consistent  with 
the  observation  for  SGi  from  Tables  2  and  3. 

8.2  Soft  criterion  functions  tend  to  make  moves 
more  consistent  with  cluster  separations. 

We  also  looked  at  the  degree  to  which  the  movement  of  documents 
between  clusters,  as  being  performed  during  the  hard  and  soft  cri¬ 
terion  function  optimizations,  affects  the  inter-cluster  separation. 
Specifically,  every  time  a  document  is  moved  between  two  clusters 
(because  such  a  move  improves  the  overall  value  of  the  criterion 
function),  we  computed  the  cosine  similarity  between  the  cluster 
centroids  before  and  after  the  move.  Figure  1(b)  shows  the  average 
percentages  of  the  moves  that  also  further  separate  the  cluster  cen¬ 
ters  for  various  criterion  functions  (i.e.,  the  cosine  value  between 
the  two  centroids  decreased  after  the  move).  Again  the  percent¬ 
ages  were  averaged  over  all  the  datasets.  The  last  data  point  for 
each  criterion  function  represents  the  average  percentage  for  the 


hard  criterion  function.  As  shown  in  Figure  1(b),  the  move  made 
by  hard  criterion  functions  will  not  always  increase  the  separation 
between  cluster  centers,  whereas  the  soft  criterion  functions  tend  to 
make  moves  that  are  more  consistent  with  cluster  separations.  For 
ST i ,  ST 2  and  S£i,  more  than  99.4%  of  the  moves  will  separate 
cluster  centers  further  when  m  =  1,  and  the  percentage  decreases 
as  m  increases  (i.e.,  the  “softness”  of  clustering  decreases).  This 
property  was  also  observed  for  fuzzy  C-means  [14]  as  well. 

8.3  Soft  criterion  functions  tend  to  lead  to  less 
balanced  clustering  solutions. 

Another  notable  difference  of  clustering  solutions  obtained  by  hard 
and  soft  criterion  functions  is  that  soft  criterion  functions  tend  to 
lead  to  less  balanced  clustering  solutions,  and  the  smaller  the  fuzzy 
factor  value  is,  the  less  balanced  the  clustering  solution  will  be  ob¬ 
tained.  Table  4  gives  an  example  of  10-way  clustering  solutions 
obtained  by  ST 2  and  T2  for  renters  1. 

The  reason  that  soft  criterion  functions  tend  to  lead  to  less  balanced 
solutions  is  that  since  now  one  document  can  contribute  to  both 
clusters,  the  difference  of  soft  sizes  between  two  clusters  will  be 
smaller  than  that  of  hard  sizes.  Flence,  soft  criterion  functions  will 
tolerate  clusters  with  higher  difference  in  cluster  sizes.  Previous 


Figure  1:  (a)  Average  percentages  of  the  moves  that  made  by  both  hard  and  soft  criterion  functions,  (b)  Average  percentages  of  the 
moves  that  increase  cluster  separations. 


Table  4:  Comparison  of  10- way  clustering  solutions  obtained  by  Jo  and  ST  2  for  reutersl. 


ST 2  with  m  =  2  (Entropy=0.194) 

T2  Criterion  (Entropy= 

0.218) 

cid 

Size 

Sim 

trad 

acq 

earn 

cid 

Size 

Sim 

trad 

acq 

earn 

0 

58 

+0.220 

0 

0 

58 

0 

83 

+0.185 

24 

48 

11 

1 

119 

+0.220 

0 

0 

119 

1 

67 

+0.187 

0 

1 

6 

2 

34 

+0.219 

32 

0 

2 

2 

136 

+0.186 

0 

2 

134 

3 

82 

+0.188 

22 

48 

12 

3 

76 

+0.153 

74 

2 

0 

4 

62 

+0.171 

59 

3 

0 

4 

67 

+0.118 

66 

0 

1 

5 

45 

+0.090 

0 

5 

40 

5 

88 

+0.096 

86 

2 

0 

6 

139 

+0.078 

137 

2 

0 

6 

79 

+0.076 

0 

75 

4 

7 

51 

+0.074 

0 

46 

5 

7 

85 

+0.067 

0 

83 

2 

8 

152 

+0.054 

0 

146 

6 

8 

98 

+0.049 

0 

71 

27 

9 

166 

+0.035 

1 

160 

5 

9 

129 

+0.038 

1 

126 

2 

studies  [46,  13]  showed  that  highly  unbalanced  clusters  will  harm 
the  quality  of  clustering  solutions,  hence  the  proper  fuzzy  factor 
should  not  be  too  small.  Note  that  from  the  discussion  in  Sec¬ 
tion  8.2,  we  know  that  as  the  value  of  the  fuzzy  factor  increases, 
a  large  fraction  of  the  moves  will  not  lead  to  better  cluster  separa¬ 
tions.  Hence,  the  fuzzy  factor  value  that  lead  to  the  best  clustering 
solution  has  to  achieve  the  balance  between  these  two  factors. 

9.  CONCLUSION  AND  FUTURE  RESEARCH 

In  this  paper  we  extended  four  criterion  functions  that  were  studied 
in  our  previous  work  [46]  to  tackle  the  soft  document  clustering 
problem.  We  developed  an  approach  similar  to  the  induced  fuzzy 
partition  [1]  to  optimize  various  soft  criterion  functions.  We  pre¬ 
sented  a  comprehensive  experimental  evaluation  involving  twelve 
different  datasets  and  some  discussions  about  the  various  trends 
observed  from  experimental  results.  Our  experimental  results  and 
analysis  show  that  the  soft  criterion  functions  tend  to  consistently 
improve  the  separation  between  the  clusters,  and  lead  to  better  clus¬ 
tering  results  for  most  datasets. 

We  plan  to  extend  the  work  in  this  paper  along  three  directions. 
First,  develop  and  evaluate  soft  clustering  extensions  for  the  re¬ 
maining  three  criterion  fuctnions  studied  in  [46],  which  includes 
some  of  the  schemes  that  optimize  internal  and  external  charac¬ 
teristics  of  clustering  solutions.  Second,  expand  our  evaluation  to 
determine  the  effectiveness  of  these  soft  criterion  functions  to  pro¬ 
duce  overlapping  clustering  solutions.  Third,  further  understand 


SQi  and  the  reason  why  it  is  less  sensitive  to  the  value  of  the  fuzzy 
factor  as  discussed  in  Section  8. 
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